- Background: who am I, what do I work on?
- What are boosted regression trees?
- How to fit and use them in R
- How have I used them in my work
4th April 2016
C106 N16 P1 ≈ elemental ratio
Simulate some data
With one predictor a tree can be represented as a piecewise function
Still wiggly
## Source: local data frame [2,000 x 3] ## ## x Grp y ## (dbl) (fctr) (dbl) ## 1 0.01 A 4.695773 ## 2 0.01 B 3.135812 ## 3 0.02 A 5.109820 ## 4 0.02 B 3.864737 ## 5 0.03 A 4.609181 ## 6 0.03 B 3.999034 ## 7 0.04 A 5.833630 ## 8 0.04 B 4.128766 ## 9 0.05 A 5.209733 ## 10 0.05 B 4.242548 ## .. ... ... ...
interaction.depth = 1 means no interaction!
brt_id_1 <- gbm (y ~ x + Grp, data = dat2
, n.trees = 1000
, shrinkage = 0.1
, interaction.depth = 1
, cv.folds = 5)
brt_id_1
## gbm(formula = y ~ x + Grp, data = dat2, n.trees = 1000, interaction.depth = 1, ## shrinkage = 0.1, cv.folds = 5) ## A gradient boosted model with gaussian loss function. ## 1000 iterations were performed. ## The best cross-validation iteration was 471. ## There were 2 predictors of which 2 had non-zero influence. ## ## Summary of cross-validation residuals: ## 0% 25% 50% 75% 100% ## -1.9250430 -0.3359167 0.0943870 0.5125533 2.1765444 ## ## Cross-validation pseudo R-squared: 0.519
best_iter <- gbm.perf(brt_id_1) print(best_iter)
## [1] 471
summary(brt_id_1, n.trees = best_iter)
## var rel.inf ## x x 93.67579 ## Grp Grp 6.32421
plot(brt_id_1
, n.trees = best_iter
, i.var = 1:2, layout = c(2, 1))
dat2$y_hat_id_1 <- predict(brt_id_1, n.trees = best_iter)
brt_id_2 <- gbm(y ~ x + Grp, data = dat2
, n.trees = 1000
, shrinkage = 0.1
, interaction.depth = 2
, cv.folds = 5)
best_iter2 <- gbm.perf(brt_id_2, plot.it = F)
dat2$y_hat_id_2 <- predict(brt_id_2, n.trees = best_iter2)
400+ lakes, > 5000 rows of data
set.seed(07052015)
crs <- parallel:::detectCores() / 2
gbm_Total.5 <- gbm(Total ~ Week + log10_TN + log10_TP
+ log10_Depth + Seen.Subtyp
, distribution = list(name="quantile", alpha=0.5)
, n.trees=10000, shrinkage = 0.01
, cv.folds = 5, interaction.depth = 5
, data = boost_dat_padded, n.cores = crs)
N:P ratio effects emerge from "unstructured" model
R package gbm - others available
There is also "An Introduction to Statistical Learning with Applications in R" Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani - http://www-bcf.usc.edu/~gareth/ISL/
fold.id